Goto

Collaborating Authors

 similarity detection



Esim: EVM Bytecode Similarity Detection Based on Stable-Semantic Graph

Chen, Zhuo, Ji, Gaoqiang, He, Yiling, Wu, Lei, Zhou, Yajin

arXiv.org Artificial Intelligence

Abstract--Decentralized finance (DeFi) is experiencing rapid expansion. However, prevalent code reuse and limited open-source contributions have introduced significant challenges to the blockchain ecosystem, including plagiarism and the propagation of vulnerable code. Consequently, an effective and accurate similarity detection method for EVM bytecode is urgently needed to identify similar contracts. Traditional binary similarity detection methods are typically based on instruction stream or control flow graph (CFG), which have limitations on EVM bytecode due to specific features like low-level EVM bytecode and heavily-reused basic blocks. Moreover, the highly-diverse Solidity Compiler (Solc) versions further complicate accurate similarity detection. Motivated by these challenges, we propose a novel EVM bytecode representation called Stable-Semantic Graph (SSG), which captures relationships between "stable instructions" (special instructions identified by our study). Moreover, we implement a prototype, Esim, which embeds SSG into matrices for similarity detection using a heterogeneous graph neural network. Esim demonstrates high accuracy in SSG construction, achieving F1-scores of 100% for control flow and 95.16% for data flow, and its similarity detection performance reaches 96.3% AUC, surpassing traditional approaches. Our large-scale study, analyzing 2,675,573 smart contracts on six EVM-compatible chains over a one-year period, also demonstrates that Esim outperforms the SOT A tool Etherscan in vulnerability search. With the rapid expansion of decentralized finance (DeFi) in the blockchain ecosystem, DeFi projects, which are built on smart contracts on the Ethereum Virtual Machine (EVM), have attracted substantial investment in recent years, with over $88.8 billion Total V alue Locked (TVL) in 2024 [1]. As a representative case, the Compound v2 protocol [3], one of the top lending protocols, has been widely adopted and forked by numerous DeFi projects. This protocol has a known precision loss issue that can be exploited when the corresponding market lacks liquidity. Since 2022, a series of attacks (e.g., Hundred Finance Attack [4], Onyx Protocol Attack [5], Radiant Attack [6]) have been observed due to the code abuse of Compound v2 protocol, resulting in millions of dollars in losses. Consequently, there is an urgent need for an efficient method to detect code reuse in EVM bytecode (binaries), a process also known as EVM bytecode similarity detection. More than 99% of the Ethereum contracts are not open source [2] In general, binary similarity detection studies in traditional languages (e.g., C++ [7], [8], [9] and Java [10]) can be divided into two categories, i.e., instruction stream based and control flow graph (CFG) based.


Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity

Gagnon, Charles E., Ding, Steven H. H., Charland, Philippe, Fung, Benjamin C. M.

arXiv.org Artificial Intelligence

Binary code similarity detection is a core task in reverse engineering. It supports malware analysis and vulnerability discovery by identifying semantically similar code in different contexts. Modern methods have progressed from manually engineered features to vector representations. Hand-crafted statistics (e.g., operation ratios) are interpretable, but shallow and fail to generalize. Embedding-based methods overcome this by learning robust cross-setting representations, but these representations are opaque vectors that prevent rapid verification. They also face a scalability-accuracy trade-off, since high-dimensional nearest-neighbor search requires approximations that reduce precision. Current approaches thus force a compromise between interpretability, generalizability, and scalability. We bridge these gaps using a language model-based agent to conduct structured reasoning analysis of assembly code and generate features such as input/output types, side effects, notable constants, and algorithmic intent. Unlike hand-crafted features, they are richer and adaptive. Unlike embeddings, they are human-readable, maintainable, and directly searchable with inverted or relational indexes. Without any matching training, our method respectively achieves 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks, comparable to embedding methods with training (39% and 34%). Combined with embeddings, it significantly outperforms the state-of-the-art, demonstrating that accuracy, scalability, and interpretability can coexist.



Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification

Wu, Zehao, Zhao, Yanjie, Wang, Haoyu

arXiv.org Artificial Intelligence

--As Large Language Models (LLMs) become integral software components in modern applications, unauthorized model derivations through fine-tuning, merging, and redistribution have emerged as critical software engineering challenges. Unlike traditional software where clone detection and license compliance are well-established, the LLM ecosystem lacks effective mechanisms to detect model lineage and enforce licensing agreements. This gap is particularly problematic when open-source model creators, such as Meta's LLaMA, require derivative works to maintain naming conventions for attribution, yet no technical means exist to verify compliance. These fingerprints enable two complementary capabilities: direct pairwise similarity assessment between arbitrary models through distance computation, and systematic family classification of unknown models via the K-Means clustering algorithm with domain-informed centroid initialization using known base models. Experimental evaluation on 58 models comprising 8 base models and 50 derivatives across five model families (Llama, Qwen, Gemma, Phi, Mistral) demonstrates 94% classification accuracy under our centroid-initialized K-Means clustering. Our work establishes a new paradigm for model similarity detection, bridging traditional software engineering practices with modern LLM distribution and compliance challenges. The proliferation of Large Language Models (LLMs) has fundamentally transformed how we conceptualize and deploy AI-powered software systems. With over one million model repositories on platforms like Hugging Face [1], LLMs have evolved from research artifacts into critical software components powering applications from code generation to intelligent assistants. Zehao Wu and Y anjie Zhao contributed equally to this work. Haoyu Wang is the corresponding author (haoyuwang@hust.edu.cn). The full name of the authors' affiliation is Hubei Key Laboratory of Distributed System Security, Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology.


MM-LINS: a Multi-Map LiDAR-Inertial System for Over-Degenerate Environments

Ma, Yongxin, Xu, Jie, Yuan, Shenghai, Zhi, Tian, Yu, Wenlu, Zhou, Jun, Xie, Lihua

arXiv.org Artificial Intelligence

SLAM plays a crucial role in automation tasks, such as warehouse logistics, healthcare robotics, and restaurant delivery. These scenes come with various challenges, including navigating around crowds of people, dealing with flying plastic bags that can temporarily blind sensors, and addressing reduced LiDAR density caused by cooking smoke. Such scenarios can result in over-degeneracy, causing the map to drift. To address this issue, this paper presents a multi-map LiDAR-inertial system (MM-LINS) for the first time. The front-end employs an iterated error state Kalman filter for state estimation and introduces a reliable evaluation strategy for degeneracy detection. If over-degeneracy is detected, the active map will be stored into sleeping maps. Subsequently, the system continuously attempts to construct new maps using a dynamic initialization method to ensure successful initialization upon leaving the over-degeneracy. Regarding the back-end, the Scan Context descriptor is utilized to detect inter-map similarity. Upon successful recognition of a sleeping map that shares a common region with the active map, the overlapping trajectory region is utilized to constrain the positional transformation near the edge of the prior map. In response to this, a constraint-enhanced map fusion strategy is proposed to achieve high-precision positional and mapping results. Experiments have been conducted separately on both public datasets that exhibited over-degenerate conditions and in real-world environments. These tests demonstrated the effectiveness of MM-LINS in over-degeneracy environment. Our codes are open-sourced on Github.


Assemblage: Automatic Binary Dataset Construction for Machine Learning

Liu, Chang, Saul, Rebecca, Sun, Yihao, Raff, Edward, Fuchs, Maya, Pantano, Townsend Southard, Holt, James, Micinski, Kristopher

arXiv.org Artificial Intelligence

Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpuses of malicious binaries, obtaining high-quality corpuses of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpuses (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpuses of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage can be downloaded from https: //assemblage-dataset.net/.


Hex2vec -- Context-Aware Embedding H3 Hexagons with OpenStreetMap Tags

Woźniak, Szymon, Szymański, Piotr

arXiv.org Artificial Intelligence

Representation learning of spatial and geographic data is a rapidly developing field which allows for similarity detection between areas and high-quality inference using deep neural networks. Past approaches however concentrated on embedding raster imagery (maps, street or satellite photos), mobility data or road networks. In this paper we propose the first approach to learning vector representations of OpenStreetMap regions with respect to urban functions and land-use in a micro-region grid. We identify a subset of OSM tags related to major characteristics of land-use, building and urban region functions, types of water, green or other natural areas. Through manual verification of tagging quality, we selected 36 cities were for training region representations. Uber's H3 index was used to divide the cities into hexagons, and OSM tags were aggregated for each hexagon. We propose the hex2vec method based on the Skip-gram model with negative sampling. The resulting vector representations showcase semantic structures of the map characteristics, similar to ones found in vector-based language models. We also present insights from region similarity detection in six Polish cities and propose a region typology obtained through agglomerative clustering.


The Image Similarity Challenge and data set for detecting image manipulation

#artificialintelligence

We also worked with trained third-party annotators to manually transform a smaller subset of the images to ensure we have even more selections representative of the way a human user would transform images. The annotators used image manipulation software GIMP to manually alter images in diverse ways that we cannot easily automate, for example handwriting or drawing on the images or cropping to leave only the part of the image most salient to the human eye. The Image Similarity Challenge invites participants to test their image matching techniques on the Image Similarity data set. More information for researchers is available here, and the accompanying paper is available here. For researchers considering attending NeurIPS 2021 in December, we're also pleased to announce that the Image Similarity Challenge has been accepted for the NeurIPS 2021 competition track, where we will be announcing the winners of this challenge (The competition is subject to official rules.


Text Similarity in Vector Space Models: A Comparative Study

Shahmirzadi, Omid, Lugowski, Adam, Younge, Kenneth

arXiv.org Machine Learning

Automatic measurement of semantic text similarity is an important task in natural language processing. In this paper, we evaluate the performance of different vector space models to perform this task. We address the real-world problem of modeling patent-to-patent similarity and compare TFIDF (and related extensions), topic models (e.g., latent semantic indexing), and neural models (e.g., paragraph vectors). Contrary to expectations, the added computational cost of text embedding methods is justified only when: 1) the target text is condensed; and 2) the similarity comparison is trivial. Otherwise, TFIDF performs surprisingly well in other cases: in particular for longer and more technical texts or for making finer-grained distinctions between nearest neighbors. Unexpectedly, extensions to the TFIDF method, such as adding noun phrases or calculating term weights incrementally, were not helpful in our context.